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1. INTRODUCTION 

Large repositories for multimedia and image content have developed due to the growing usage of 
digital computers, storage technologies, and digital multimedia in recent years. This enormous volume of 
multimedia data is utilized in various industries, including digital forensics, electronic games, archaeology, 
video, satellite data and still image repositories, and medical treatment. This rapid growth has generated a 
continuous need for image retrieval systems that operate on a large scale [1]. For large image databases, the 
traditional text-based image extraction approach seems ineffective. There are certain drawbacks to retrieving 
images based on text, such as the time-consuming task of adding labels to individual images in huge databases. 
That label text depends on language and is only appropriate for one language at a time. Another drawback is that 
multiple users can set different labels for the same image. When retrieving images from image content, these 
drawbacks can be avoided. This type of image retrieval is known as content-based image retrieval (CBIR) [2]. 

CBIR has been a popular technique of community multimedia research since the early 1990s [3]. The 
main block diagram is shown in Figure 1. CBIR is the most crucial technology for image processing and 
computer vision. CBIR applications have been developed for various uses, including object recognition, 
geographic information systems, architectural design, remote sensing [4], surveillance systems, and medical 
image retrieval [5]. CBIR is a well-defined image search and retrieval technique. It uses the visual content of 
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images to find and retrieve images from huge data collections [6]. CBIR uses the search method for low-level 
features such as texture, shape, and color [7]. This set of low-level features generates a feature vector, which 
describes the content of each image in the image database. Subsequently, image retrieval is based on similarities 
in their contents. The similarity between the query and the feature vector dataset is used to sort the list of 
matching images [8]. The features used by CBIR may be divided into two groups: global feature descriptors 
and local feature descriptors. Global features such as color [9], shape [10], and texture [11]. Local features 
such as local binary pattern (LBP) [12], oriented fast and rotated binary robust independent elementary features 
BRIEF (ORB) [13], speeded-up robust feature (SURF) [14], scale-invariant feature transform (SIFT) [15], and 
histogram of oriented gradient (HOG) [16]. 

Local feature descriptors define each image patch, whereas global feature descriptors describe the 
whole image. The advantage of global feature descriptors is their quick computation, but the disadvantage is 
their poor precision. Global feature descriptors frequently fall short in their attempts to extract significant visual 
characteristics from an image. Local feature descriptors are more accurate than global feature descriptors 
because they use features calculated from the image's patch to represent the image. The disadvantage of local 
features is that they will result in large feature space for large image databases [17]. The feature vectors and 
similarity measures have the biggest effects on the retrieval performance of the CBIR system. There is always 
a semantic gap between high-level human perception and the low-level image pixels that systems collect. 
Researchers decided to address this problem to enhance CBIR's performance in light of the recent success of 
deep learning techniques, particularly the performance of convolutional neural networks (CNN), in resolving 
the issue of computer vision applications [18]. 

Srivastava and Khare [19] presented a technique for CBIR that combines local and global features. 
Geometric moments extract global features, while SIFT descriptors extract local features. SIFT and moments 
are combined to find visually similar images. The Corel 1K has an average precision of 0.3981. 
Mehmood et al. [20] combine the SURF and HOG image features. After the final features were extracted using 
the bag-of-visual words (BoVW) model, they used Euclidean measurement to evaluate the similarity between 
the query image and the database images. The average precision is 0.8061 on the Corel 1K datasets. These 
methods have two main drawbacks: first, they require a lot of time, and second, finding the most silent places 
is not always easy. 

Nazir et al. [21], suggest a new CBIR system that combines local and global features to handle low- 
level information. A color histogram (CH) is used to extract color information. Edge histogram descriptor and 
discrete wavelet transform (DWT) extract texture features (EDH). Based on the results of the experiments, the 
suggested method does better on the Corel 1K, with an average precision of 0.735. 

Pardede et al. [22] suggested a CBIR method that employed deep CNN for feature extraction from 
fully connected FC1 and FC2 layers. The Feature Extractor utilizes the fully connected feature vectors FV.FC1 
and FV.FC2 to extract image features from each image and compares the performance of deep CNN for CBIR 
tasks with three classifications: softmax, support vector machine (SVM), and extreme gradient boost 
(XGBoost). A deep CNN model was produced based on the suggested neural network structure. The results of 
the mathematical experiments suggest by utilizing the XGBoost classification, the extracted feature extractor 
from deep CNN can improve CBIR performance, and the best feature extractor is FV.FC2. The precision on 
the Wang dataset (Corel 1k) is 0.69. 

Oztiirk [23] proposes a useful CBIR framework. The dictionary learning method addresses the 
training issue with a small amount of labelled data. Dictionary learning (DL) cannot produce reliable features 
for the retrieval task, particularly when there is a complicated background. To address both the issue of 
identifying objects and dealing with complicated backgrounds, a DL technique utilizing CNN's (Resnet-50) 
feature representation capabilities is implemented in this system. When 10 images are retrieved, the mean 
average precision (mAP) for the modified Corel dataset is 0.855. Oztiirk [24] provided a framework for content- 
based medical image retrieval (CBMIR) based on high-level deep features. The insufficient number of photos 
is the main problem here. To address this issue, a class-driven retrieval strategy is suggested. Different hash 
code lengths are produced using feature reduction methods, and their performances are evaluated. Experiments 
with the National Electrical Manufacturers Association Magnetic Resonance Imaging (NEMA MRI) and the 
National Electrical Manufacturers Association Computed Tomography (NEMA CT) datasets show that the 
framework given is better than the existing methods in the literature. Desai et al. [25] suggested an effective 
deep learning architecture for fast image retrieval based on convolution neural networks CNN and SVM. SVM 
is used for classification to reduce the time required to retrieve the results. VGG16 is used to extract features. 
It has 12 convolutional layers, 4 fully connected layers, and, as the last layer, a SoftMax classifier. The average 
precision for retrieving 10 images from the Corel dataset is 0.8361. The VGG16 has 138 million parameters, 
which is a drawback since it causes an explosion in the gradient issue. This paper's main contributions may be 
summarized: 
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— This paper employed CNN to extract deep features from photos to bridge the semantic gap between high- 
level human perception and the low-level picture pixels that computers gather to produce the most 
relevant images. 

— The goal is to determine the appropriate CNN architecture while focusing on selecting the best 
hyperparameter values. 

— Two experiments are used: the CNN model with max pooling and the CNN model with average pooling. 

— In similarity measurement phase, two distance measurements were implemented: Euclidean and City 
Block (Manhattan) to find the best in terms of mAP. 


Dataset 


Vectors of 
dataset 
features 


Similarity 
measurement 


The feature extraction 
process 


Query image 
features 
ee ECTON 


Query Image Retrieved images 


Figure 1. Block diagram of content-based image retrieval 


The paper has been arranged in the following manner: Section 2 presents the proposed methodology, 
and Section 3 displays the similarity measurement. In addition, Section 4 presents the experimental results and 
discussion. This paper is concluded in Section 5. 


2. METHOD 

In this paper, CNN is used for feature extraction since it is efficient at closing the semantic gap 
between high-level human perception and low-level machine features, as well as at finding the most relevant 
images and improving retrieval performance. The Corel 1K [26] database was utilized to validate the results. 
It is a 1,000-image database collection. These images are organized into 10 categories, each of which has 100 
images. A block diagram of the proposed method is illustrated in Figure 2. 


Dataset Images Query image 
Resize Image 


Extract Feature Extract features by using the CNN-Proposed model 


feature vector 
Image Retrieval of dataset 
images 


Similarity Distance query features 
(Euclidean, Manhattan) vector 


Figure 2. Block diagram of the proposed method 


2.1. Feature extraction 
CNN is a type of neural network model that uses several large network layers [27]. CNN has gained 
popularity in various image processing applications, including object recognition [28] and picture 
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classification [29], and has produced promising results. CNN's are increasingly used in various image- 
processing tasks, such as object classification, face recognition, and gesture identification. According to earlier 
research, it is possible to input an image directly into a CNN network and use features for image 
classification [30]. The basic CNN architecture is composed of convolutional layers, pooling layers, fully 
connected layers (FC), SoftMax layers, and non-linear activation functions like rectifier neural network 
(ReLUV) [31], [32]. The forward pass stage includes a convolution layer, where an activation map is produced 
as the result of computing the dot product of the filter's input volume and the filter's dot product. Next, use the 
ReLU function to decrease negative values and pooling to downsample the feature maps before activating the 
value. Numerous iterations of this phase are carried out, with no restrictions on how often it is repeated. The 
final step of the forward pass enters the fully connected layer, where the output is created in vector form. To 
determine whether the output belongs to the model class, the SoftMax values and error values can be computed 
for the values in the training dataset after getting the output from the fully connected layer. CNN works on 
image volumes. The input volume can therefore be thought of as the input image. Width, height, and depth are 
the three dimensions that make up the volume. CNN's initial values for the input volume are W, H, and D [33]. 
The filter shifts from the top to the bottom of the input volume, beginning at the top left and moving to the top 
right. Every motion from left to right is performed as thoroughly as a stride. The number of steps convolutes 
its stride. Since ReLU transforms the negative pixel value to 0, it is a quick activation function. When the value 
is 0, the result is 0 [34], as seen in (1). The hidden layer's size is huge after the convolution procedure. It is 
typical to utilize a pooling or sub-sampling layer right after a convolutional layer to decrease computational 
complexity. Max and average pooling are two types of pooling that are commonly utilized [35]. Let y=yij 
represent the matrix in a pool. 


ReLU (Y) = max (0,Y) (1) 
Using the maximum element in y as the output is known as max pooling, as seen in (2). 
x = max (y) (2) 


Taking the average of all the element values (y;j) is known as average pooling [18], as seen in (3). 


x= 0 DN yi, (3) 
M and N represent the elements in the pooled matrix. This paper employed two experiments, the first 
with max pooling and the second with average pooling. The proposed CNN architecture model comprises 19 
layers, as shown in Figure 3. A proposed CNN model is used to extract dataset feature vectors. The model 
contains six convolutional layers and six batch normalization layers to normalize the data. Each two- 
convolution layer is followed by a pooling layer, two dropout layers for regularities, and two fully connected 
layers. The original images, which are 384x256 or 256x384 pixels in size, will be resized to 64 x 64 pixels 
before they are fed into the CNN model. In the layers of the CNN model, the filter is scaled up and down so 
that features can be found. The internal architecture of the CNN model used to train the CBIR is shown in 
Table 1. The hyperparameters used in this paper are illustrated in Table 2. For the optimizer, Adam is used. 


FC1 FC2 
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Dropout 
Pooling 


Batch normalization 
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Figure 3. Proposed CNN-model architecture 
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Table 1. The internal architecture of the CNN model 


Layers (type) Input image shape _No. filter Size of filter | Window size of the pooling output Para. 
conv2d (Conv2D) (64 ,64, 3) 32 33 (64,64,32) 896 
batch normalization (64,64,32) (64,64,32) 128 
conv2d_1 (Conv2D) (64,64,32) 32 343. (64,64,32) 9248 
batch normalization_1 (64,64,32) (64,64,32) 128 
average_pooling2d (64,64,32) 272 (32, 32, 32) 0 
dropout 0.3 0 
conv2d_2 (Conv2D) (32, 32, 32) 64 3%.3 (32, 32, 64) 18496 
batch normalization_2 (32, 32, 64) (32, 32, 64) 256 
conv2d_3 (Conv2D) (32, 32, 64) 64 3%*3 (32, 32, 64) 36928 
batch normalization_3 (32, 32, 64) (32, 32, 64) 256 
average_pooling2d_1 (32, 32, 64) Dd, (16, 16, 64) 0 
conv2d_4 (Conv2D) (16, 16, 64) 128 383 (16, 16,128) 73856 
batch normalization_4 (16, 16, 128) (16, 16,128) 512 
conv2d_5 (Conv2D) (16 ,16, 128) 128 33 (16, 16,128) 147584 
batch normalization_5 (16, 16, 128) (16, 16,128) 512 
average_pooling2d_2 (16,16, 128) 2). (8, 8, 128) 0 
Dropout_1 0.2 0 
flatten (8, 8, 128) 8192 0 


Table 2. Hyperparameters’ value 


Hyperparameters Value 
Split data 900 train, 100 queries 
Dropout 0.3, 0.2 
Batch size 128 
Learning rate 0.001 
Num. of epochs 500 


2.2. Image retrieval 

Determine the difference in similarity between the feature vector obtained from the query and the 
feature vector from the training dataset by using two similarity measurements: Euclidean and Manhattan. As 
indicated in Algorithm 1, the images with the shortest distance are returned. Evaluation matrices are used to 
assess the system's performance. 


Algorithm 1: image retrieval 
Input: Feature vector of 128 X 1 X 1, Output: Similar image and average precision 
- Label the feature vector of images for all classes. 
- Convert nominal classes to numeric values. Ex. the bus is 3, the flower is 6 
= Dataset is divided into 900 training and 100 testing 


~ Compute the distance between feature vectors of all testing and training and retrieve 
the smallest distance. 


- Calculate the mean average precision for all testing (Queries) 
= Display the most similar images for the query image. 


3. SIMILARITY MEASUERMENT 

The Euclidean and Manhattan distances measure the relationship between the feature vectors of the 
query images and the feature vector of the training dataset images. If the distance between the two vectors is 
the smallest with reference to the other distances, the generated image and the query are similar, as seen in (4) 
and (5), respectively. Where m represents the size of the feature vector; Fvap and Fv,are the feature vectors for 
the dataset and query images, respectively. 


D(q,db) = | M_ | (Fvav(i)- Fvq (i) (4) 


D(q, db) = Dik, | Fvav (i)- Fvq (i) (5) 


4. RESULTS AND DISCUSSION 
4.1. Dataset 

The Corel 1K dataset [26] has 1,000 JPEG images with a 256 by 384 or 384 by 256-pixel size. 100 
images in each of the 10 categories make up this collection. The categories include Africa, horses, flowers, 
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beaches, buses, buildings, mountains, dinosaurs, elephants, flowers, and food. Figure 4 displays an example of 
each category. Retrieval was made more challenging and robust by keeping all the training images in one folder 
and all the testing images in another. This means a test folder with 100 images, each of which is a query image, 
and a training folder with 900 images. 


Figure 4. Examples of each category in the Corel 1K dataset 


4.2. Evaluation measurement 

Performance in the CBIR is assessed using precision and mean average precision (mAP). Divide the 
total number of images retrieved by the total number of relevant images to determine precision. It shows a 
system's ability to only return relevant images, as seen in (6). The mAP is the mean of the average precision 
for all classes. For average precision (AP), see (7), and mAP, see (8). Where P represents the precision, and n 
is the number of images, AP; denotes the AP of class K, and m denotes the number of classes. 


P= Number of relevant images retrieved (6) 
~~ Total number of images retrived 
AP =~Y™,P (7) 
Ty Siar ti 
1ym 
mAP = —YRL, AP, (8) 


4.3. Experiment results 

Retrieval performance has been assessed using precision. Higher precision means that it returns more 
relevant images than irrelevant ones. The precision of each query across all categories was calculated along 
with the average determined precision. The Corel 1K image dataset is utilized. The Corel 1K dataset contains 
diverse images ranging from natural scenes to outdoor activities to various animals, making it suitable for 
testing image retrieval systems. Two retrieval methods are used. The first method is based on CNN with max 
pooling, while the second is based on CNN with average pooling and various feature sizes. After several 
attempts to find the right hyperparameter value, shown in Tables 3 and 4, where the learning rate is 0.01, and 
the number of epochs is 100 and 500, respectively, the hyperparameter in Table 2 fits the architecture made in 
this paper. 


Table 3. Average precision results when the Table 4. Average precision results when the 
learning rate is 0.01 and epochs are 100 learning rate is 0.01 and epochs are 500 
Feature vector 128 Feature vector Flatten (8192) 128 256 
Africa 0.52 Africa 0.5 0.66 0.64 
Beaches 0.36 Beaches 0.59 0.42 0.41 
Buildings 0.47 Buildings 0.25 0.66 0.68 
Bus 0.84 Bus 0.46 0.95 0.96 
Dinosaurs 0.97 Dinosaurs 0.98 0.98 0.98 
Elephants 0.59 Elephants 0.66 0.83 0.8 
Flowers 0.93 Flowers 1 0.98 0.97 
Horses 0.62 Horses 0.83 0.56 0.59 
Mountains 0.78 Mountains 0.69 0.71 0.68 
Foods 0.42 Foods 0.48 0.61 0.57 
mAP 0.649 mAP 0.644 0.736 0.729 


The results depend on the nature of the images. Some classes have simple and distinct colors that 
make distinguishing objects from the background easier. Other classes have similar colors, so it is difficult to 
distinguish. Also, max pooling selects the strong pixels and almost neglects the weak ones. It works as an edge 


Content-based image retrieval based on corel dataset using deep learning (Rasha Qassim Hassan) 


1860 O ISSN: 2252-8938 


detector. As for average pooling, it takes a group of pixels, combines them, and divides them by a number 
according to the length of the matrix, working as an image smoother. Bus, Buildings, dinosaurs, flowers, 
horses, and mountains have the highest average precision of any other class in the average pooling. For max 
pooling, the classes bus, dinosaur, flower, horse, and mountain have the highest average precision. Table 5 
shows the average precision for Euclidean distance. A feature size of 128 based on CNN with average pooling 
achieves higher average precision than max pooling when using Manhattan distance, as seen in Table 6. The 
best result, as shown in Tables 5 and 6, was achieved at Euclidean distance with feature sizes of 256 and 10 
retrieve images. Figure 5 compares the average precision measured on the Corel 1K dataset to state-of-the-art 
traditional methods. The retrieved images, according to a query image using CNN, are illustrated in Figure 6. 


Table 5. The average precision for each class with different feature vector sizes for average pooling and max 
pooling based on Euclidean distance 


Feature size Flatten (8,192) 128 256 512 1,000 
AVG- Number of retrieved 10 20 10 20 10 20 10 20 10 20 
pooling images 
Africa 0.71 0.66 0.9 0.88 0.91 0.89 0.76 0.71 0.73 0.71 
Beaches 0.63 0.59 0.65 0.65 0.65 0.65 0.78 0.75 0.77 0.74 
Buildings 0.86 0.83 0.95 0.94 0.95 0.95 0.9 0.9 0.9 0.9 
Bus 0.96 0.92 0.9 0.9 0.9 0.89 0.98 0.96 0.94 0.93 
Dinosaurs 1 1 1 1 1 1 1 1 1 1 
Elephants 0.38 0.35 0.75 0.75 0.75 0.73 0.75 0.74 0.75 0.74 
Flowers 1 1 1 1 1 1 1 1 1 1 
Horses 1 1 0.98 0.98 0.99 0.98 0.97 0.97 0.98 0.97 
Mountains 0.84 0.79 1 1 1 1 1 0.99 1 0.99 
Foods 0.49 0.45 0.65 0.66 0.65 0.66 0.53 0.52 0.54 0.53 
mAP 0.787 0.759 0.878 0.876 0.88 0.876 0.866 0.855 0.86 0.852 
Max-pooling Africa 0.71 0.61 0.79 0.79 0.8 0.8 0.76 0.72 0.75 0.72 
Beaches 0.55 0.53 0.5 0.5 0.51 0.52 0.59 0.56 0.58 0.56 
Buildings 0.71 0.66 0.75 0.73 0.8 0.73 0.81 0.82 0.8 0.81 
Bus 0.65 0.62 0.94 0.92 0.93 0.91 0.88 0.86 0.84 0.84 
Dinosaurs 1 1 1 1 1 1 1 1 1 1 
Elephants 0.7 0.65 0.88 0.86 0.88 0.86 0.78 0.78 0.81 0.81 
Flowers 1 1 1 1 1 1 1 1 1 1 
Horses 1 0.97 0.92 0.92 0.92 0.91 0.9 0.9 0.9 0.9 
Mountains 0.9 0.8 0.9 0.86 0.88 0.85 0.81 0.83 0.81 0.82 
Foods 0.43 0.42 0.68 0.63 0.64 0.62 0.32 0.33 0.31 0.32 
mAP 0.765 0.726 0.835 0.821 0.835 0.821 0.783 0.78 0.781 0.778 


Table 6. The average precision for each class with different feature vector sizes for average pooling and max 
pooling based on Manhattan distance 


Feature Size Flatten (8,192) 128 256 512 1,000 
AVG- Number of retrieved 10 20 10 20 10 20 10 20 10 20 
pooling images 

Africa 0.58 0.56 0.93 0.9 0.9 0.89 0.76 0.74 0.79 = 0.78 
Beaches 0.64 0.62 0.63 0.62 0.62 0.65 0.78 0.76 0.78 0.76 
Buildings 0.84 0.81 0.93 0.93 0.96 0.95 0.9 0.9 0.9 0.9 
Bus 0.92 0.88 0.89 0.88 0.9 0.89 0.93 0.93 0.91 0.91 

Dinosaurs 1 1 1 1 1 1 1 1 1 1 
Elephants 0.31 0.28 0.75 0.74 0.75 0.73 0.75 0.76 0.72 0.73 

Flowers 1 1 1 1 1 1 1 1 1 1 
Horses 1 0.99 1 1 1 0.98 0.98 0.98 0.98 0.97 
Mountains 0.8 0.74 1 0.99 1 1 1 0.98 1 0.99 
Foods 0.48 0.46 0.65 0.66 0.66 0.66 0.53 0.55 0.55 0.56 
mAP 0.756 0.733 0.879 0.872 0.879 0.876 0.864 0.86 0.863 0.86 
Max-pooling Africa 0.62 0.54 0.82 0.8 0.79 0.8 0.77 0.74 0.76 0.74 
Beaches 0.59 0.55 0.57 0.56 0.51 0.52 0.65 0.63 0.62 0.58 
Buildings 0.77 0.71 0.73 0.72 0.76 0.73 0.8 0.8 0.81 0.82 
Bus 0.63 0.59 0.91 0.9 0.93 0.91 0.85 0.86 0.84 0.84 

Dinosaurs 1 1 1 1 1 1 1 1 1 1 
Elephants 0.67 0.62 0.86 0.85 0.88 0.86 0.76 0.76 0.77. = 0.79 

Flowers 1 1 1 1 1 1 1 1 1 1 
Horses 0.97 0.96 0.93 0.93 0.93 0.91 0.91 0.9 0.9 0.9 
Mountains 0.89 0.79 0.9 0.87 0.87 0.85 0.84 0.84 0.82 0.83 
Foods 0.5 0.44 0.63 0.61 0.62 0.62 0.34 0.34 0.33 0.3 
mAP 0.764 0.721 0.836 0.823 0.827 0.821 0.792 0.787 0.786 0.78 
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Figure 5. Comparison with state-of-the-art traditional methods 
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Figure 6. Query images are on the left, while their retrieved images are on the right 


4.4. Comparing the results with other papers employing deep learning 

This paper compares the results with two other papers using deep learning and the same dataset 
shown in Table 7. Pardede et al. [22] used deep CNN for feature extraction from FC1 and FC2 and SVM, 
softmax, and XGBoost for classifiers. Deep CNN architectures used 3 Conv layers+tReLU and 3 max-pooling 
layers, two FC layers to minimize overfitting before flattening, 3 dropouts, and 1 batch normalization. 
Desai et al. [25] extract features using a VGG16 layered CNN model. VGG16 consists of 12 Conv layers, 5 
max-pooling layers, and 4 FC layers. The feature vector created from these layers is then fed to the SVM, 
which calculates the distance between each image in the dataset and the query image. The proposed CNN 
model used six Conv layers, six batch normalization layers, three average pooling layers, and two FC layers 
with hyperparameters, as shown in Table 2. As a result, compared to methods in related work for the same 
dataset, this paper's structure and parameters produced better results. 


Table 7. Comparison with a state-of-the-art method 
Authors in [22] 2019 Authors in [25] 2021 _ Proposed method using Avg. pool 


No. of the retrieved image 10 10 
Africa 0.63 0.84 0.91 
Beaches 0.53 0.8406 0.65 
Buildings 0.36 0.8353 0.95 
Bus 0.54 0.8273 0.9 
Dinosaurs 0.82 0.832 1 
Elephants 0.63 0.8386 0.75 
Flowers 1 0.8308 1 
Horses 0.93 0.8413 0.99 
Mountains 0.60 0.838 1 
Foods 0.88 0.8373 0.65 
Avg. 0.69 0.8361 0.88 
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5. CONCLUSION 

This paper proposed the CBIR technique using CNN for feature extraction. Two different pooling 
layers are used to extract features: max and average. The performance of the proposed method was assessed 
using precision and mAP. Euclidean and Manhattan similarity measurements are used to compute the distance 
between the query and database image features. The experiment results on the Corel 1K dataset with Euclidean 
showed a significant improvement in average precision of 0.88 when using average pooling with a feature size 
of 256 for retrieving the first 10 images when compared to other methods that had previously been proposed, 
such as CNN+SVM, SIFT, local and global CH for a color feature, DWT+EDH for a texture feature, and 
BoVW that used two feature extractions like HOG and SURF. The proposed method is more accurate than the 
existing state-of-the-art approaches, which are good and promising. 
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